Here is a fact every AI demo conveniently ignores: your LLM provider will go down. Not "might." Will. They rate-limit you mid-launch, deprecate the model your app depends on, raise prices overnight, and have regional outages. On the day that happens, a single-provider app is simply down — and you find out from your users.
An LLM gateway with a fallback chain is the unglamorous layer that fixes this. It routes each request down a priority list of providers until one succeeds, trips a circuit breaker on the ones that fail, and keeps your agent answering. This is how I built Agent-Routing, extracted from 18 months of running it in production.
1. Fallback Chains Per Task Class
Not every task wants the same provider. Code generation, prose, and cheap classification have different best-fit models and different cost tolerances. So the router keys a separate priority chain to each task class:
code: openai -> gemini -> anthropic -> ollama
content: gemini -> openai -> anthropic -> ollama
ui: gemini -> openai -> ollama
simple: openai -> gemini -> ollama
The rule is simple: try the first provider; on failure or timeout, fall through to the next. No API key for a provider? It's skipped. All cloud providers down? A local Ollama model serves the request. The agent never fully stops. My portfolio's own assistant runs a chat chain of free providers this way — NVIDIA NIM → Groq → OpenCode Zen → OpenRouter with a paid model held back as the last-resort safety net.
2. The Session Circuit Breaker
Naive fallback has a hidden cost: if a provider is down, every single request still tries it first, waits for the timeout, and only then falls through. Under load that's thousands of wasted, slow calls. The fix is a circuit breaker — once a provider fails all its models, mark it OPEN for the rest of the session and skip it entirely:
CIRCUIT OPENED: openai failed all models and is marked down for this session.
This is the difference between "degrades gracefully" and "hangs for everyone." The first failure pays the timeout cost; every request after it routes straight to a healthy provider.
3. Budget-Aware Routing
Reliability without cost control is how you wake up to a four-figure bill. The router can take a per-call budget and only route to models whose estimated cost fits:
await router.chat(prompt, system, 'code', 0.3, 4096, budget: 0.001);
// Only routes to models whose estimated cost fits within $0.001
Combined with a free-provider-first chain, this means the expensive model only ever runs when the cheap ones have genuinely failed — the cost ceiling and the reliability floor are the same mechanism.
4. Token Optimization (Free Savings on Every Call)
Before a prompt is sent, the router compresses it: filler words and phrases removed, whitespace collapsed. On verbose prompts that's a routine 10–20% reduction in input tokens — and since you pay per token on every provider in the chain, that saving compounds across every fallback attempt.
5. Guardrails: The Gateway Is Also a Checkpoint
A single choke point for all LLM traffic is the perfect place to enforce safety. The router checks input for prompt injection before it ever reaches a model, and scans output to redact leaked secrets — API keys, JWTs, database connection strings — before returning:
- Input: prompt-injection detection rejects obvious attacks at the door.
- Output: secret-pattern redaction stops a model from echoing a credential into a response.
You can't bolt this on at 40 call sites. You get it for free when every call goes through one gateway.
What I Built
Agent-Routing implements all five: per-task fallback chains, session circuit breakers, budget-aware routing, token optimization, and injection/secret guardrails. The demo runs in mock mode with no API keys. The same router powers the grounded assistant on this site — see how I built the chatbot for it running live, and the concepts behind multi-agent systems for where routing fits in the bigger picture.
If you're tired of AI hype, this is the kind of boring layer that actually decides whether your app survives contact with production.